1 Introduction

Before embarking on developing statistical models and generating predictions, it is essential to understand your data. This is typically done using conventional numerical and graphical methods. John Tukey (Tukey, 1977) advocated the practice of exploratory data analysis (EDA) as a critical part of the scientific process.

“No catalog of techniques can convey a willingness to look for what can be seen, whether or not anticipated. Yet this is at the heart of exploratory data analysis. The graph paper and transparencies are there, not as a technique, but rather as a recognition that the picture examining eye is the best finder we have of the wholly unanticipated.”

Fortunately, we can dispense with the graph paper and transparencies and use software that makes routine work of developing the ‘pictures’ (i.e., graphical output) and descriptive statistics needed to explore our data.

Descriptive statistics include:

  • Mean - arithmetic average
  • Median - middle value
  • Mode - most frequent value
  • Standard Deviation - variation about the mean
  • Interquartile Range - range encompasses 50% of the values
  • Kurtosis - peakedness of the data distribution
  • Skewness - symmetry of the data distribution

Graphical methods include:

  • Histogram - a bar plot where each bar represents the frequency of observations for a given range of values
  • Density estimation - an estimation of the frequency distribution based on the sample data
  • Quantile-quantile plot - a plot of the actual data values against a normal distribution
  • Box plots - a visual representation of median, quartiles, symmetry, skewness, and outliers
  • Scatter plots - a graphical display of one variable plotted on the x axis and another on the y axis
  • Radial plots - plots formatted for the representation of circular data

1.1 Objectives

  • Determine Low, RV, and High values (e.g. mean, median, etc…) for component and horizon data.
  • Visualize your data to identify trends, using Rs base and ggplot2 graphics.
  • Learn how to use the soilReports package to generate Rmarkdown reports.
  • Learn how to properly transform your data for analysis.

2 Data Inspection

Before you start an EDA, you should inspect your data and correct all typos and blatent errors. EDA can then be used to identify additional errors such as outliers and help you determine appropriate statistical analyses. For this chapter we’ll use the loafercreek dataset from the CA630 Soil Survey Area.

library(soilDB)

# Load from the the loakercreek dataset
data("loafercreek") 

# Generalized the horizon designations
n <- c("A",
       "BAt",
       "Bt1",
       "Bt2",
       "Cr",
       "R")
# REGEX rules
p <- c("A",
       "BA|AB",
       "Bt|Bw",
       "Bt3|Bt4|2B|C",
       "Cr",
       "R")

# Compute genhz labels and add to loafercreek dataset
loafercreek$genhz <- generalize.hz(loafercreek$hzname, n, p)

# Extract the horizon table
h <- horizons(loafercreek) 

# Extract the site table
s <- site(loafercreek)

# Examine the matching of pairing of the genhz label to the hznames
with(h, table(genhz, hzname))
##           hzname
## genhz      2BCt 2Bt2 2Bt3 2Bt4 2CB 2Cr 2Crt 2R  A AB ABt Ad Ap  B BA BAt
##   A           0    0    0    0   0   0    0  0 62  0   0  1  1  0  0   0
##   BAt         0    0    0    0   0   0    0  0  0  1   0  0  0  0 11   3
##   Bt1         0    0    0    0   0   0    0  0  0  0   4  0  0  0  0   0
##   Bt2         1    1    5    4   1   0    0  0  0  0   0  0  0  0  0   0
##   Cr          0    0    0    0   0   3    1  0  0  0   0  0  0  0  0   0
##   R           0    0    0    0   0   0    0  2  0  0   0  0  0  0  0   0
##   not-used    0    0    0    0   0   0    0  0  0  0   0  0  0  1  0   0
##           hzname
## genhz      BCt Bt Bt1 Bt2 Bt3 Bt4 Bw  C CBt Cr Crt Oi  R Rt
##   A          0  0   0   0   0   0  0  0   0  0   0  0  0  0
##   BAt        0  0   0   0   0   0  0  0   0  0   0  0  0  0
##   Bt1        0  5  59  58   0   0  3  0   0  0   0  0  0  0
##   Bt2        5  0   0   0  27   3  0  2   3  0   0  0  0  0
##   Cr         0  0   0   0   0   0  0  0   0 35  18  0  0  0
##   R          0  0   0   0   0   0  0  0   0  0   0  0 22  2
##   not-used   0  0   0   0   0   0  0  0   0  0   0 21  0  0

As noted in Chapter 1, a visual examination of the raw data is possible by looking at the R object:

View(h)

# or by clicking on the dataset in the environment tab

This view is fine for a small dataset, but can be cumbersome for larger ones. The summary() function can be used to quickly summarize a dataset however, even for our small example dataset, the output can be voluminous. Therefore in the interest of saving space we’ll only look at a sample of columns.

vars <- c("genhz", "clay", "total_frags_pct", "phfield", "effervescence")
summary(h[vars])
##       genhz          clay       total_frags_pct    phfield     
##  A       : 64   Min.   :10.00   Min.   : 0.00   Min.   :4.900  
##  BAt     : 15   1st Qu.:17.00   1st Qu.: 0.00   1st Qu.:6.000  
##  Bt1     :129   Median :21.00   Median : 5.00   Median :6.200  
##  Bt2     : 52   Mean   :22.05   Mean   :13.01   Mean   :6.135  
##  Cr      : 57   3rd Qu.:26.00   3rd Qu.:20.00   3rd Qu.:6.500  
##  R       : 26   Max.   :48.00   Max.   :87.00   Max.   :6.900  
##  not-used: 22   NA's   :110                     NA's   :204    
##  effervescence     
##  Length:365        
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
# or

# sub <- subset(h, select = vars)
# summary(sub)

The summary() function is known as a generic R function. It will return a preprogrammed summary for any R object. Because h is a data frame, we get a summary of each column. Factors will be summarized by their frequency (i.e., number of observations), while numeric or integer variables will print out a five number summary, and characters simply print their length. The number of missing observations for any variable will also be printed if they are present. If any of these metrics look unfamiliar to you, don’t worry we’ll cover them shortly.

When you do have missing data and the function you want to run will not run with missing values, the following options are available:

  1. Exclude all rows or columns that contain missing values using the function na.exclude(), such as h2 <- na.exclude(h). However this can be wasteful because it removes all rows (e.g., horizons), regardless if the row only has 1 missing value. Instead it’s sometimes best to create a temporary copy of the variable in question and then remove the missing variables, such as clay <- na.exclude(h$clay).
  2. Replace missing values with another value, such as zero, a global constant, or the mean or median value for that column, such as h$clay <- ifelse(is.na(h$clay), 0, h$clay) # or h[is.na(h$clay), ] <- 0.
  3. Read the help file for the function you’re attempting to use. Many functions have additional arguments for dealing with missing values, such as na.rm.

A quick check for typos would be to examine the list of levels for a factor or character, such as:

# just for factors
levels(h$genhz)
## [1] "A"        "BAt"      "Bt1"      "Bt2"      "Cr"       "R"       
## [7] "not-used"
# for characters and factors
# unique(h$hzname) 

# sort unique hznames
sort(unique(h$hzname)) 
##  [1] "2BCt" "2Bt2" "2Bt3" "2Bt4" "2CB"  "2Cr"  "2Crt" "2R"   "A"    "AB"  
## [11] "ABt"  "Ad"   "Ap"   "B"    "BA"   "BAt"  "BCt"  "Bt"   "Bt1"  "Bt2" 
## [21] "Bt3"  "Bt4"  "Bw"   "C"    "CBt"  "Cr"   "Crt"  "Oi"   "R"    "Rt"

If the unique() function returned typos such as “BT” or “B t”, you could either fix your original dataset or you could make an adjustment in R, such as:

h$hzname <- ifelse(h$hzname == "BT", "Bt", h$hzname)

# or

# h$hzname[h$hzname == "BT"] <- "Bt"

# or as a last resort we could manually edit the spreadsheet in R

# edit(h)

Typo errors such as these are a common problem with old pedon data in NASIS.

2.1 Exercise: fetch and inspect

  • Load the gopheridge dataset found within the soilDB package or use your own data (highly encouraged) and inspect the dataset
  • Apply the generalized horizon rules below or develop your own, see the following job-aid
  • Summarize the depths, genhz, texture class, sand, and fine gravel.
  • Show your work and submit the results to your mentor.
# gopheridge rules
n <- c('A', 'Bt1', 'Bt2', 'Bt3','Cr','R')
p <- c('^A|BA$', 'Bt1|Bw','Bt$|Bt2', 'Bt3|CBt$|BCt','Cr','R')

3 Examining Distributions

Now that we’ve checked for missing values and typos and made corrections, we can graphically examine the sample data distribution of our integer and numeric data. Frequency distributions are useful because they can help us visualize the center (e.g., RV) and spread or dispersion (e.g., low and high) of our data. Typically in introductory statistics the normal (i.e., Gaussian) distribution is emphasized.

3.1 The normal distribution

What is a normal distribution and why should you care? Many statistical methods are based on the properties of a normal distribution. Applying certain methods to data that are not normally distributed can give misleading or incorrect results. Most methods that assume normality are robust enough for all data except the very abnormal. This section is not meant to be a recipe for decision making, but more an extension of tools available to help you examine your data and proceed accordingly. The impact of normality is most commonly seen for parameters used by pedologists for documenting the ranges of a variable (i.e., Low, RV and High values). Often a rule-of thumb similar to: “two standard deviations” is used to define the low and high values of a variable. This is fine if the data are normally distributed. However, if the data are skewed, using the standard deviation as a parameter does not provide useful information of the data distribution. The quantitative measures of Kurtosis (peakedness) and Skewness (symmetry) can be used to assist in accessing normality and can be found in the fBasics package, but Webster (2001) cautions against using significance tests for assessing normality. The preceding sections and chapters will demonstrate various methods to cope with alternative distributions.

A Gaussian distribution is often referred to as “Bell Curve”, and has the following properties (Lane):

  1. Gaussian distributions are symmetric around their mean
  2. The mean, median, and mode of a Gaussian distribution are equal
  3. The area under the curve is equal to 1.0
  4. Gaussian distributions are denser in the center and less dense in the tails
  5. Gaussian distributions are defined by two parameters, the mean and the standard deviation
  6. 68% of the area under the curve is within one standard deviation of the mean
  7. Approximately 95% of the area of a Gaussian distribution is within two standard deviations of the mean

Viewing a histogram or density plot of your data provides a quick visual reference for determining normality as discussed in section 4.1. Distributions are typically normal, Bimodal or Skewed:

Figure 2. Sample histograms

Occasionally distributions are Uniform, or nearly so:

Figure 3. Uniform distribution

3.2 Histograms and density curves

Next we’ll review the ggplot2 geom_histogram() and geom_density() functions which plot a histogram and density curve respectively.

library(ggplot2)

ggplot(h, aes(x = clay)) +
  geom_histogram()

# or

ggplot(h, aes(x = clay)) +
  geom_histogram() +
  facet_wrap(~ genhz)

# or using the graphics package

# hist(h$clay, col = "grey")

Figure 4. Histogram

Since histograms are dependent on the number of bins, for small datasets they’re not the best method of determining the shape of a distribution. A density estimation, also known as a Kernel density plot, generally provides a better visualization of the shape of the distribution in comparison to the histogram.

ggplot(h, aes(x = clay, y = ..density..)) +
  geom_histogram() +
  geom_density() +
  facet_wrap(~ genhz)

ggplot(h, aes(x = clay)) +
  geom_density()

# or using the graphics package

# test <- density(h$clay, na.rm = TRUE)
# plot(test) 

Figure 5. The Kernel density plot depicts a smoothed line of the distribution

Compared to the histogram where the y-axis represents the number or percent (i.e., frequency) of observations, the y-axis for the density plot represents the probability of observing any given value, such that the area under the curve equals one. Notice how the histogram of clay seems to emphasis the long right tail of values, which we might question as to whether we have a normal distribution. One curious feature of the density curve is the hint of a bimodal distribution. Given that our sample includes a mixture of surface and subsurface horizons, we may have two different populations. However considering how much the two distributions overlap, it seems impractical to separate them in this instance.

3.3 Quantile comparison plots (QQplot)

A QQplot is a plot of the actual data values against a normal distribution (which has a mean of 0 and standard deviation of 1).

A QQplot of sand content may be made for the sample dataset as:

ggplot(h, aes(sample = clay)) + 
  geom_qq() +
  geom_qq_line() +
  ggtitle("Normal Q-Q Plot for Clay")

ggplot(h, aes(sample = total_frags_pct)) + 
  geom_qq() +
  geom_qq_line() +
  ggtitle("Normal Q-Q Plot for Frags")

Figure 6. QQplot

The line represents the quantiles of a normal distribution. If the data set is perfectly normal, the data points will fall along the line. Overall this plot shows that our clay example is more or less normally distributed. However the second plot again shows that our rock fragments are far from normally distributed.

A more detailed explanation of QQplots may be found on Wikipedia:
https://en.wikipedia.org/wiki/QQ_plot

4 Measures of Central Tendency

These measures are used to determine the mid-point of the range of observed values. In NASIS speak this should ideally be equivalent to the representative value (RV) for numeric and integer data. The mean and median are the most commonly used measures for our purposes.

Mean - is the arithmetic average all are familiar with, formally expressed as:R GUI image which sums ( \(\sum\) ) all the X values in the sample and divides by the number (n) of samples. It is assumed that all references in this document refer to samples rather than a population.

The mean clay content from the loafercreek dataset may be determined:

clay <- na.exclude(h$clay) # first remove missing values and create a new vector

mean(clay)
## [1] 22.04863
# or use the additional na.rm argument

mean(h$clay, na.rm = TRUE)
## [1] 22.04863

To determine the mean by a group or category, use the aggregate command:

aggregate(clay ~ genhz, data = h, mean)
##   genhz     clay
## 1     A 15.47903
## 2   BAt 18.93333
## 3   Bt1 22.94603
## 4   Bt2 28.60577

Median is the middle measurement of a sample set, and as such is a more robust estimate of central tendency than the mean. This is known as the middle or 50th quantile, meaning there are an equal number of samples with values less than and greater than the median. For example, assuming there are 21 samples, sorted in ascending order, the median would be the 11th sample.

The median from the sample dataset may be determined:

median(clay)
## [1] 21

To determine the median by group or category, use the aggregate command again:

aggregate(clay ~ genhz, data = h, median)
##   genhz clay
## 1     A 15.0
## 2   BAt 18.0
## 3   Bt1 22.5
## 4   Bt2 28.0
# or we could use the summary() function to get both the mean and median

aggregate(clay ~ genhz, data = h, summary)
##   genhz clay.Min. clay.1st Qu. clay.Median clay.Mean clay.3rd Qu.
## 1     A  10.00000     13.00000    15.00000  15.47903     16.92500
## 2   BAt  14.00000     16.50000    18.00000  18.93333     20.00000
## 3   Bt1  12.00000     19.00000    22.50000  22.94603     26.00000
## 4   Bt2  10.00000     26.00000    28.00000  28.60577     31.25000
##   clay.Max.
## 1  23.00000
## 2  27.00000
## 3  43.00000
## 4  48.00000

In this example the mean and median are only slightly different, so we can safely assume that we have a normal distribution. However many soil variables often have a non-normal distribution. For example, let’s look at graphical examination of the mean vs. median for clay and rock fragments:

amean = mean(clay)
amed <- median(clay)

ggplot(h, aes(x = clay)) +
  geom_density() +
  geom_vline(xintercept = amean, col = "blue") +
  geom_vline(xintercept = amed, col = "orange") +
  ggtitle("Comparison of the Mean (Blue) and Median (Orange) for Clay Content")

amean <- mean(h$total_frags_pct)
amed <- median(h$total_frags_pct)

ggplot(h, aes(x = total_frags_pct)) +
  geom_density() +
  geom_vline(xintercept = amean, col = "blue") +
  geom_vline(xintercept = amed, col = "orange") +
  ggtitle("Comparison of the Mean (Blue) and Median (Orange) for Rock Fragments")

Figure 7. Comparison of the mean vs median for clay and rock fragments.

The blue vertical line represents the breakpoint for the median and the orange represents the mean. The median is a more robust measure of central tendency compared to the mean. In order for the mean to be a useful measure, the data distribution must be approximately normal. The further the data departs from normality, the less meaningful the mean becomes. The median always represents the same thing independent of the data distribution, namely, 50% of the samples are below and 50% are above the median. The example from Figure 7 for clay again indicates that distribution is approximately normal. However for rock fragments, we see a long tailed distribution (e.g., skewed). Using the mean in this instance would overestimate the rock fragments. Although in this instance the difference between the mean and median is only 8 percent.

Mode - is the most frequent measurement in the sample. The use of mode is typically reserved for factors, which we will discuss shortly. One issue with using the mode for numeric data is that the data need to be rounded to the level of desired precision. In following example you can see where it looks like someone entered the laboratory data into the pedon horizon data. It’s hard to tell from the example, but the first column in each sequence shows the values and the following columns show the number occurrences for that value.

table(h$clay) # show a frequency table
## 
##               10               11               12               13 
##                3                3                7                8 
##               14               15               16 16.7000007629395 
##               10               10               18                1 
##               17               18               19 19.6000003814697 
##               13               16               13                1 
##               20               21             21.5               22 
##               15               12                1               10 
##               23 23.6000003814697               24               25 
##                9                1               13               13 
##               26               27               28               29 
##               17                8               17                9 
##               30               31               32               33 
##                4                1                5                4 
##               34               35               36               38 
##                2                3                1                1 
##               39               41               42               43 
##                1                1                2                1 
##               48 
##                1
# we can fix the rounding error like so

# table(round(h$clay))

# or

# table(as.integer(h$clay)) 

sort(table(round(h$clay)), decreasing = TRUE)[1] # sort and select the 1st value, which will be the mode
## 16 
## 18

5 Measures of Dispersion

These are measures used to determine the spread of values around the mid-point. This is useful to determine if the samples are spread widely across the range of observations or concentrated near the mid-point. In NASIS speak these values might equate to the low (L) and high (H) values for numeric and integer data.

Variance is a positive value indicating deviation from the mean:

This is the square of the sum of the deviations from the mean, divided by the number of samples minus 1. It is commonly referred to as the sum of squares. As the deviation increases, the variance increases. Conversely, if there is no deviation, the variance will equal 0. As a squared value, variance is always positive. Variance is an important component for many statistical analyses including the most commonly referred to measure of dispersion, the standard deviation. Variance for the sample dataset is:

var(clay)
## [1] 45.53172

Standard Deviation is the square root of the variance:

The units of the standard deviation are the same as the units measured. From the formula you can see that the standard deviation is simply the square root of the variance. Standard deviation for the sample dataset is:

sd(clay)
## [1] 6.74772
# or

# sqrt(var(clay))

Coefficient of Variation (CV) is a relative (i.e., unitless) measure of standard deviation:R GUI image

CV is calculated by dividing the standard deviation by the mean and multiplying by 100. Since standard deviation varies in magnitude with the value of the mean, the CV is useful for comparing relative variation amongst different datasets. However Webster (2001) discourages using CV to compare different variables. Webster (2001) also stresses that CV is reserved for variables that have an absolute 0, like clay content. CV may be calculated for the sample dataset as:

cv <- sd(clay) / mean(clay) * 100
cv
## [1] 30.60381

Quantiles (a.k.a. Percentiles) - the percentile is the value that cuts off the first nth percent of the data values when sorted in ascending order.

The default for the quantile() function returns the min, 25th percentile, median or 50th percentile, 75th percentile, and max, known as the five number summary originally proposed by Tukey. Other probabilities however can be used. At present the 5th, 50th, and 95th are being proposed for determining the range in characteristics (RIC) for a given soil property.

quantile(clay)
##   0%  25%  50%  75% 100% 
##   10   17   21   26   48
# or

quantile(clay, c(0.05, 0.5, 0.95))
##   5%  50%  95% 
## 12.7 21.0 33.3

Thus, for the five number summary 25% of the observations fall between each of the intervals. Quantiles are a useful metric because they are largely unaffected by the distribution of the data, and have a simple interpretation.

Range is the difference between the highest and lowest measurement of a group. Using the sample data it may be determined as:

range(clay)
## [1] 10 48

which returns the minimum and maximum values observed, or:

diff(range(clay))
## [1] 38
# or

# max(clay) - min(clay)

which returns the value of the range

Interquartile Range (IQR) is the range from the upper (75%) quartile to the lower (25%) quartile. This represents 50% of the observations occurring in the mid-range of a sample. IQR is a robust measure of dispersion, unaffected by the distribution of data. In soil survey lingo you could consider the IQR to estimate the central concept of a soil property. IQR may be calculated for the sample dataset as:

IQR(clay)
## [1] 9
# or

# diff(quantile(clay, c(0.25, 0.75)))

5.1 Box plots

Box plots are a graphical representation of the five number summary, depicting quartiles, minimum, maximum and outliers (if present). Boxplots convey the shape of the data distribution, the presence of extreme values, and the ability to compare with other variables using the same scale, providing an excellent tool for screening data quality, determining thresholds for variables and developing working hypotheses.

The parts of the boxplot are shown in Figure 8. The “box” of the boxplot is defined as the 1st quartile, (Q1 in the figure) and the 3rd quartile, (Q3 in the figure). The median, or 2nd quartile, is the dark line in the box. The whiskers show data that is 1.5 * IQR above and below the 3rd and 1st quartile. Any data point that is beyond a whisker is considered an outlier.

That is not to say the points are in error, just that they are extreme compared to the rest of the dataset. However, you may want to evaluate these points to ensure that they are correct.

R GUI image

R GUI image

Figure 8. Boxplot description (Seltman, 2009)

A boxplot of sand content by horizon may be made for the sample dataset as:

ggplot(h, (aes(x = genhz, y = clay))) +
  geom_boxplot()

# or using the graphics package

# boxplot(clay ~ genhz, xlab = "Horizon", ylab="Clay (%)", data = h)

Figure 9. Box plot of clay by horizon

The xlab and ylab parameters control the titles of the x and y axis.

The above box plot shows a steady increase in clay content with depth. Notice the outliers in the box plots, identified by the individual circles.

Frequencies

To summarize factors and characters we can examine their frequency or number of observations. This is accomplished using the table() function.

table(h$genhz)
## 
##        A      BAt      Bt1      Bt2       Cr        R not-used 
##       64       15      129       52       57       26       22
# or

# summary(h$genhz)

This gives us a count of the number of observations for each horizon. If we want to see the comparison between two different factors or characters, we can include two variables.

table(h$genhz, h$hzname)
##           
##            2BCt 2Bt2 2Bt3 2Bt4 2CB 2Cr 2Crt 2R  A AB ABt Ad Ap  B BA BAt
##   A           0    0    0    0   0   0    0  0 62  0   0  1  1  0  0   0
##   BAt         0    0    0    0   0   0    0  0  0  1   0  0  0  0 11   3
##   Bt1         0    0    0    0   0   0    0  0  0  0   4  0  0  0  0   0
##   Bt2         1    1    5    4   1   0    0  0  0  0   0  0  0  0  0   0
##   Cr          0    0    0    0   0   3    1  0  0  0   0  0  0  0  0   0
##   R           0    0    0    0   0   0    0  2  0  0   0  0  0  0  0   0
##   not-used    0    0    0    0   0   0    0  0  0  0   0  0  0  1  0   0
##           
##            BCt Bt Bt1 Bt2 Bt3 Bt4 Bw  C CBt Cr Crt Oi  R Rt
##   A          0  0   0   0   0   0  0  0   0  0   0  0  0  0
##   BAt        0  0   0   0   0   0  0  0   0  0   0  0  0  0
##   Bt1        0  5  59  58   0   0  3  0   0  0   0  0  0  0
##   Bt2        5  0   0   0  27   3  0  2   3  0   0  0  0  0
##   Cr         0  0   0   0   0   0  0  0   0 35  18  0  0  0
##   R          0  0   0   0   0   0  0  0   0  0   0  0 22  2
##   not-used   0  0   0   0   0   0  0  0   0  0   0 21  0  0
table(h$genhz, h$texture_class)
##           
##            br  c cl  l mpm scl sic sicl sil sl spm
##   A         0  0  0 51   0   0   0    0   9  3   0
##   BAt       0  0  0 12   0   0   0    1   1  1   0
##   Bt1       0  1 20 81   0   3   1    4  17  0   0
##   Bt2       0  2 24 15   0   4   1    2   4  0   0
##   Cr       51  0  0  0   0   0   0    0   0  0   0
##   R        23  0  0  0   0   0   0    0   0  0   0
##   not-used  0  0  1  0   1   0   0    0   0  0  18

We can also easily add margin totals to the table or convert the table frequencies to proportions.

# append the table with row and column sums

addmargins(table(h$genhz, h$texture_class))
##           
##             br   c  cl   l mpm scl sic sicl sil  sl spm Sum
##   A          0   0   0  51   0   0   0    0   9   3   0  63
##   BAt        0   0   0  12   0   0   0    1   1   1   0  15
##   Bt1        0   1  20  81   0   3   1    4  17   0   0 127
##   Bt2        0   2  24  15   0   4   1    2   4   0   0  52
##   Cr        51   0   0   0   0   0   0    0   0   0   0  51
##   R         23   0   0   0   0   0   0    0   0   0   0  23
##   not-used   0   0   1   0   1   0   0    0   0   0  18  20
##   Sum       74   3  45 159   1   7   2    7  31   4  18 351
# calculate the proportions relative to the rows, margin = 1 calculates for rows, margin = 2 calculates for columns, margin = NULL calculates for total observations

round(prop.table(table(h$genhz, h$texture_class), margin = 1) * 100) 
##           
##             br   c  cl   l mpm scl sic sicl sil  sl spm
##   A          0   0   0  81   0   0   0    0  14   5   0
##   BAt        0   0   0  80   0   0   0    7   7   7   0
##   Bt1        0   1  16  64   0   2   1    3  13   0   0
##   Bt2        0   4  46  29   0   8   2    4   8   0   0
##   Cr       100   0   0   0   0   0   0    0   0   0   0
##   R        100   0   0   0   0   0   0    0   0   0   0
##   not-used   0   0   5   0   5   0   0    0   0   0  90

6 Measures of Association

6.1 Correlation matrix

A correlation matrix is a table of the calculated correlation coefficients of all variables. This provides a quantitative measure to guide the decision making process. The following will produce a correlation matrix for the sp4 dataset:

h$hzdepm <- with(h, (hzdepb + hzdept) / 2) # Compute the middle horizon depth

vars <- c("hzdepm", "clay", "sand", "total_frags_pct", "phfield")

round(cor(h[vars], use = "complete.obs"), 2)
##                 hzdepm  clay  sand total_frags_pct phfield
## hzdepm            1.00  0.60 -0.10            0.50   -0.01
## clay              0.60  1.00 -0.18            0.31    0.19
## sand             -0.10 -0.18  1.00           -0.03    0.10
## total_frags_pct   0.50  0.31 -0.03            1.00   -0.25
## phfield          -0.01  0.19  0.10           -0.25    1.00

As seen in the output, variables are perfectly correlated with themselves and have a correlation coefficient of 1.0. Negative values indicate a negative relationship between variables. What is considered highly correlated? A good rule of thumb is anything with a value of 0.7 or greater is considered highly correlated.

6.2 Scatterplots

Plotting points of one ratio or interval variable against another is a scatter plot. Plots can be produced for a single or multiple pairs of variables. Many independent variables are often under consideration in soil survey work. This is especially common when GIS is used, which offers the potential to correlate soil attributes with a large variety of raster datasets.

The purpose of a scatterplot is to see how one variable relates to another. With modeling in general the goal is parsimony (i.e., simple). The goal is to determine the fewest number of variables required to explain or describe a relationship. If two variables explain the same thing, i.e., they are highly correlated, only one variable is needed. The scatterplot provides a perfect visual reference for this.

Create a basic scatter plot using the loafercreek dataset.

ggplot(h, aes(x = clay, y = hzdepm)) +
  geom_point()

# or

# plot(clay ~ hzdepm, data = h)

Figure 10. Scatter Plot

This plots clay on the X axis and depth on the X axis. As shown in Figure 10, there is a moderate correlation between these variables.

The function below produces a scatterplot matrix for all the numeric variables in the dataset. This is a good command to use for determining rough linear correlations for continuous variables.

# Load the GGally package
library(GGally)

# Create a scatter plot matrix
vars <- c("hzdepm", "clay", "phfield", "total_frags_pct")

GGally::ggpairs(h[vars])

# or using base graphics

## Create a scatter plot matrix
# pairs(h[vars])

7 Transformations

Slope aspect and pH are two common variables warranting special consideration for pedologists.

7.1 pH

Since pH has a logarithmic distribution, the use of median and quantile ranges are the preferred measures when summarizing pH. Remember, pHs of 6 and 5 correspond to hydrogen ion concentrations of 0.000001 and 0.00001 respectively. The actual average is 5.26; -log10((0.000001 + 0.00001) / 2). If no conversions are made for pH, the mean and sd in the summary are considered the geometric mean and sd, not the arithmetic. The wider the pH range, the greater the difference between the geometric and arithmetic mean. The difference between the correct average of 5.26 and the incorrect of 5.5 is small, but proper handling of data types is a best practice.

If you have a table with pH values and wish to calculate the arithmetic mean using R, this example will work:

# arithematic mean
-log10(mean(1/10^-h$phfield, na.rm = TRUE)) 
## [1] -6.309127
# geometric mean
mean(h$phfield, na.rm = TRUE) 
## [1] 6.134783

If there is a need to create a surface of pH values, i.e., interpolate values from point observations, the operation of determining values at unknown points is analogous to determining an average and the use of hydrogen ion concentration would be the proper input.

If spatial interpolation is going to be performed, the following steps should be executed:

  1. transform pH to the actual H+ concentration
  2. interpolate
  3. back transform to log value

Here is a brief example for interpolating pH using common software:

  1. Assume a comma delimited text file with pH, and x and y coordinates named “Excel_ph2.csv”
  2. Open the file in Excel and it looks similar to this:
    R GUI image
  3. Format a column as numeric with ~15 decimals and a header named H_concentration
  4. Enter a formula in the first empty cell as: =(1/10^B2) * 1000000
  5. Drag the cell down to all empty records, which results in a transformed H+ concentration R GUI image

This is a workaround for ArcGIS, which truncates data that are extremely small like the H+ concentration for pH > 7.

  1. Bring the text file into ArcGIS as points using Make XY Event Layer
R GUI image

R GUI image

R GUI image

R GUI image

Opening the table for the Event layer:

R GUI image

R GUI image

  1. Interpolate using the interpolation method of choice - Spline will be used in this example

R GUI image
R GUI image

  1. The resulting values correspond to H+ concentration * 106
R GUI image

R GUI image

  1. Convert values to pH using Raster Calculator
R GUI image

R GUI image

  1. The values now correspond to pH
R GUI image

R GUI image

7.2 Circular data

Slope aspect - requires the use of circular statistics for summarizing numerically, or graphical interpretation using circular plots. For example, if soil map units being summarized have a uniform distribution of slope aspects ranging from 335 degrees to 25 degrees, the Zonal Statistics tool in ArcGIS would return a mean of 180.

The most intuitive means available for evaluating and describing slope aspect are circular plots available with the circular package in R and the radial plot option in the TEUI Toolkit. The circular package in R will also calculate circular statistics like mean, median, quartiles etc.

library(circular)

aspect <- s$aspect_field
aspect <- circular(aspect, template="geographic", units="degrees", modulo="2pi")

summary(aspect)
##        n     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.      Rho 
##  59.0000  20.0000 -82.5000 205.0000 209.4000 140.5000  28.0000   0.1765 
##     NA's 
##   4.0000

The numeric output is fine, but a following graphic is more revealing, which shows the dominant Southwest slope aspect.

rose.diag(aspect, bins = 8, col="grey")

Figure 11. Rose Diagram

8 Soil Reports

One of the strengths of NASIS is that it that has a wealth of existing queries and reports. This makes it easy for the average user to load their data and run numerous reports. In R the soilDB similarly has many fetch functions to load data. Recently the soilReports package was developed to replicate NASIS’s report repository. Currently soilReports contains a small collection of reports designed to summarize field and laboratory pedons, and map unit polygons. This collection of reports automates many of methods reviewed thus far, but are no substitute for interacting with your data and asking questions. Examples can be found at the following link, https://github.com/ncss-tech/soil-pit/tree/master/examples.

The soilReports R package is largely just a collection of R Markdown (.Rmd) reports. R Markdown is a document format that makes it easy to create reports and other dynamic documents. It allows R code and text to be mingled in the same document and executed like an R script. This allows R to generate reports similar to NASIS.

Lets get busy and demonstrate how to run an existing .Rmd to summarize a series of pedons for a given soil series.

8.1 Requirements

  • Data are properly populated, otherwise the report may fail. Common examples include:
    • Horizon depths don’t lineup
    • Either the Pedon or Site tables isn’t loaded
  • ODBC connection to NASIS is setup
  • Beware each report has a unique configuration file that may need to be edited.

8.2 Instructions

  1. Load your NASIS selected set. Run a query such as “POINT - Pedon/Site/NCSSlabdata by upedonid and Taxon Name” from the Region 11 report folder to load your selected set. Be sure to target both the site, pedon and lab layer tables. Remove from your selected set the pedons and sites you wish to exclude from the report.

  2. Install/rei-install the soilReports package. This package is updated regularly (e.g. weekly), and should be re-installed from GitHub regularly.

# Install the soilReports package from GitHub
devtools::install_github("ncss-tech/soilReports", dependencies=FALSE, upgrade_dependencies=FALSE)
  1. View the list of available reports.
# Load the soilReports and rmarkdown package
library(soilReports)
library(rmarkdown)

# List reports
listReports()
  1. Copy the lab summary to your working directory.
# Copy report to your directory
reportInit(reportName = "region11/lab_summary_by_taxonname", outputDir = "C:/workspace/lab_sum")
  1. Examine the report folders contents. The reported is titled report.Rmd. Notice their are several other support files. The parameters for the report are contained in the config.R file.
  1. Check or create a genhz_rules file for a soil series. In order to aggregate the pedons by horizon designation, a genhz_rules file (e.g., Miamian_rules.R) is needed. See below.

If none exists see the following job aid on how to create one, Assigning Generalized Horizon Labels.

Pay special attention to how carrot “^” and dollar “$” symbols are used. They function as anti-wildcards. For example, a “^” placed before an A horizon, “^A”, will match any horizon designation that follows A, such as Ap, Ap1, etc. Also, placing a “$” after a Bt horizon, “Bt$”, will match any horizon designation that precedes Bt, such as 2Bt or 3Bt. Encapsulating a horizon with both “^” and “$” symbols will result only in exact matches. For example ^A$, will only match A, not Ap or A1.

  1. Excute the report.

Command-line approach

# Set report parameters
series <- "Miamian"

# report file path
report_path <- "C:/workspace/lab_sum/report.Rmd"

# Run the report
render(input = report_path)

Manual approach

Open the config.R file, enter/replace the soil series of interest, and remove the # comment before series object. Be sure to encase the soil series name in quotations, using either double or single quotations, such as "series" or 'series'.

Open the report.Rmd, and hit the knit button.

  1. Save the report. The report is automatically saved upon creation in the same folder as the R report. However, it is given the same generic name as the R report (i.e., C:/workspace/lab_sum/report.html), and will be overwritten the next time the report is run. Therefore, if you wish to save the report, rename the .html file to a name of your choosing, or convert it to a pdf. Also beware when reopening the .html with Internet Explorer, click on “Allow blocked content” if prompted. Otherwise Internet Explorer may alter the formatting within the document.

Sample pedon report

8.3 Exercise 2: run the lab_summary_by_taxonname.Rmd

  • Load your selected set with the pedon and site table for an existing GHL file, or make your own (highly encouraged)
  • Run the lab_summary_by_taxonname.Rmd report on a soil series of your choice.
  • Show your work and submit the results to your mentor.

9 Zonal Statistics

To summarize spatial data for a polygon, some form of zonal statistics can be used. Zonal statistics is a generic term for statistics that aggregate data for an area or zone (e.g., a set of map unit polygons). This can be accomplished via two methods. The most common method provided by most GIS is the census survey method, which computes a statistical summary of all the raster cells that overlap a polygon or map unit. This approach is generally faster and provides a complete summary of the spatial data. An alternative approach is the sample survey method, which takes a collection of random samples from each polygon or map unit. While the sample approach is generally slower and does not sample every cell that overlaps a polygon it does offer certain advantages. For example the census approach used by most GIS typically only provides basic statistics such as: the min, max, mean, standard deviation, and sum. However, for skewed data sets the mean and standard deviation may be unreliable. A better alternative for skewed data sets is to use non-parametric statistics like quantiles. Examples of non-parametric statistics are covered in Chapter 4. The advantage of the sample approach is that it allows us to utilize alternative statistics, such as quantiles, and perform a more thorough exploratory analysis. While some people might prefer the census approach because it provides a complete summary of all the data that overlaps a map unit, it is important to remember that all spatial data are only approximations of the physical world and therefore are only estimates themselves with varying levels of precision.

Before extracting spatial data for the purpose of spatial prediction, it is necessary that the data meet the following conditions:

  • All data conform to a common projection and datum
  • All raster data have a common cell resolution
  • All raster data are co-registered, that is, the geographic coordinates of cell centers are the same for all layers. Setting the Snap Raster in the ArcGIS Processing Environment prior to the creation of raster derivatives can ensure cell alignment. An ERDAS model is also available to perform this task.

9.1 Zonal Statistics in R

Zonal statistics in R can be implemented using either the census or sample approach. While R can compute zonal statistics using the census approach with the zonal() function in the raster package, it is considerably faster to call another GIS via the RSAGA or spgrass6 packages. These additional GIS packages provide R functions to access SAGA and GRASS commands. For this example the RSAGA package will be used.

# library(rgdal)
# 
# ca794 <- readOGR(dsn = "E:/geodata/project_data/8VIC/ca794", layer = "ca794") # import CA794 map unit polygons
# ca794 <- spTransform(ca794, CRS("+init=epsg:5070")) # reproject
# ca794$mukey2 <- as.integer(as.character(ca794$MUKEY)) # convert MUKEY to  an integer
# writeOGR(ca794, dsn = "C:/workspace", layer = "ca794", driver = "ESRI Shapefile", overwrite_layer = TRUE) # export shapefile
# 
# library(RSAGA)
# myenv <- rsaga.env(path = "C:/Program Files/QGIS Essen/apps/saga") # set rsaga path
# ned <- raster("E:/geodata/project_data/8VIC/sdat/ned30m_8VIC.sdat") # load DEM
# test <- raster(extent(ca794), ext = extent(ned), crs = crs(ned), res = res(ned))  # create a blank raster that matches the DEM
# writeRaster(test, file = "E:/geodata/project_data/8VIC/sdat/ca794.sdat", format = "SAGA", progress = "text", overwrite = TRUE) # export the raster
# 
# # Convert the CA794 shapefile to a rsaga raster
# 
# rsaga.geoprocessor("grid_gridding", 0, env = myenv, list(
#   INPUT = "C:/workspace/ca794.shp",
#   FIELD = "mukey2",
#   OUTPUT = "2",
#   TARGET = "0",
#   GRID_TYPE = "2",
#   USER_GRID = "E:/geodata/project_data/8VIC/sdat/ca794.sgrd",
#   USER_XMIN = extent(test)[1] + 15,
#   USER_XMAX = extent(test)[2] - 15,
#   USER_YMIN = extent(test)[3] + 15,
#   USER_YMAX = extent(test)[4] - 15,
#   USER_SIZE = res(test)[1]
# )
# )
# 
# # Extract the zonal statistics for 2 rasters
# 
# rsaga.geoprocessor("statistics_grid", 5, env = myenv, list(
#   ZONES = "E:/geodata/project_data/8VIC/sdat/ca794.sgrd",
#   STATLIST = paste(c("E:/geodata/project_data/8VIC/sdat/ned30m_8VIC.sgrd", "E:/geodata/project_data/8VIC/sdat/ned30m_8VIC_slope5.sgrd"), collapse = ";"),
#   OUTTAB = "C:/workspace/github/stats_for_soil_survey/trunk/data/ca794_zonal.csv"
# ))
# 
test <- read.csv("C:/workspace2/github/stats_for_soil_survey/trunk/data/ca794_zonal.csv") # import the results
names(test)[1] <- "mukey" # rename the mukey column
subset(test, mukey == 2480977) # examine mukey 2480977
##      mukey Count.UCU ned30m_8VICN ned30m_8VICMIN ned30m_8VICMAX
## 76 2480977    204460       204460       503.1204       1687.588
##    ned30m_8VICMEAN ned30m_8VICSTDDEV ned30m_8VICSUM ned30m_8VIC_slope5N
## 76        942.5077          206.0661      192705133              204460
##    ned30m_8VIC_slope5MIN ned30m_8VIC_slope5MAX ned30m_8VIC_slope5MEAN
## 76              0.195744              94.46768               38.52244
##    ned30m_8VIC_slope5STDDEV ned30m_8VIC_slope5SUM
## 76                 16.73909               7876299

To implement the sample approach to zonal statistics we can use the extract() and over() functions in the raster and sp packages respectively.

# # Take a stratified random sample
# s <- spsample(ca794, n = 1000, type = "stratified")
# 
# setwd("E:/geodata/project_data/8VIC/ca794")
# 
# # Create a raster stack
# rs <- stack(c(
#   elev = "ned30m_8VIC.tif", 
#   slope = "ned30m_8VIC_slope5.tif")
#   )
# # Set the spatial projection
# proj4string(rs) <- CRS("+init=epsg:5070") 
# 
# # Extract the map unit polygon value
# test1 <- over(s, ca794)
#
# # Extract the raster values
# test2 <- data.frame(extract(rs, s))
#
# # Combine the two datasets
# test2 <- cbind(test1, test2)
#
# # Cache/save the results 
# save(test2, file = "C:/workspace/github/stats_for_soil_survey/trunk/data/ch2_sample.Rdata")

# Load the cache/saved results
load(file = "C:/workspace2/github/stats_for_soil_survey/trunk/data/ch2_sample.Rdata")

# Examine summary for mukey 2480977
summary(subset(test2, MUKEY == 2480977)) 
##  AREASYMBOL   SPATIALVER     MUSYM        MUKEY        mukey2       
##  CA794:53   Min.   :2    1255   :53   2480977:53   Min.   :2480977  
##             1st Qu.:2    1220   : 0   1910055: 0   1st Qu.:2480977  
##             Median :2    1225   : 0   1910056: 0   Median :2480977  
##             Mean   :2    1230   : 0   1910058: 0   Mean   :2480977  
##             3rd Qu.:2    1240   : 0   1910059: 0   3rd Qu.:2480977  
##             Max.   :2    1241   : 0   1910060: 0   Max.   :2480977  
##                          (Other): 0   (Other): 0                    
##       elev            slope       
##  Min.   : 566.8   Min.   : 9.731  
##  1st Qu.: 801.2   1st Qu.:26.031  
##  Median : 930.8   Median :42.491  
##  Mean   : 959.7   Mean   :40.773  
##  3rd Qu.:1123.6   3rd Qu.:53.470  
##  Max.   :1467.1   Max.   :71.997  
## 

Compare the results of the census and sample approaches above. While the census approach surveyed 204,460 cells, the sample approach only surveyed 5,777. However we can see that the results are largely similar between the two approaches.

9.1.1 Exercise 3: calculating zonal statistics

  • Use your own data
  • Extract the zonal statistics for one soil survey area or several polygons, using the sample approach
  • Show your work and submit the results to your coach

9.2 TEUI toolkit

The TEUI toolkit works in a similar manner to Zonal Statistics as Table within ArcGIS, with the added benefit of interactive graphics to aid in the assessment. TEUI is an ArcGIS Add-in and may be installed without Administrator privilege. Additional information and downloads are available here:

http://www.fs.fed.us/eng/rsac/programs/teui/downloads.html

One advantage of TEUI is the ability to output tabular data at both the map unit and polygon level with one operation. SECTIONS 2 - 5 of the TEUI User Guide are excerpted below.

SECTION 2. Creating and opening an existing project

The Toolkit requires the user to specify a folder location to store the database that contains the statistics and the file location of the data used to create them. This database can be an existing project folder location, but it is recommended that a new folder be created to reduce the number of extraneous files in a folder with other data.

Create a New Toolkit Project

  1. On the TEUI Toolbar, select the Folder icon. R GUI image

  2. A dialog window will appear.
    R GUI image

  3. Select Browse…

  4. In the Browse for Folder dialog, navigate to a location of your choice and select “Make New Folder”

  5. Type in an appropriate name for your project

  6. Click OK

Open an Existing Toolkit Project

  1. On the TEUI Toolbar, select the Folder icon R GUI image

  2. A dialog window will appear.
    R GUI image

If the project was recent, click on the project in the Recent dialog box and click Open. If it is older and does not appear in the dialog box, select Browse and navigate to the project folder location. Select the project and click OK.

SECTION 3. Manage geospatial data and calculate statistics

Managing project data with version 5.0 is very simple. The user simply points the program to the file location of the data on your computer that is to be used in the analysis.

Note: If you move the geospatial data used in an existing Toolkit project, you will need to re-link the Toolkit to the data when opening that project.

Add Analysis Features to a Toolkit Project

  1. On the TEUI Toolbar, select the Data Manager icon. R GUI image
  2. The Data Manager dialog window will appear.
R GUI image

R GUI image

Analysis Features: are zones within which you would like to generate statistics. These features can be feature classes such polygons or points (line features will be available soon) as a shapefile (.shp), as a feature class within a file geodatabase, or within a feature dataset within a file geodatabase. Analysis features can also be in the form of a discrete raster. The value field will be used as the identifier.

Rasters: are the raster data that are to be used to generate the descriptive statistics. An example would be a digital elevation model (DEM), percent slope, aspect, or land use for a discrete raster.

  1. In the Analysis Features area, click on the Add Layer button.
R GUI image

R GUI image

  1. A data navigation window will appear. Navigate to the analysis layer you wish to use and select Add. You can add as many data layers as you wish to analyze.
R GUI image

R GUI image

Note: All data layers to be analyzed (analysis features and raster data) must have the same geographic coordinate system and projection as each other. You will receive a warning if they are different.

Remove Analysis Features from a Toolkit Project

  1. To remove an analysis layer from the Data Manager, click on the layer you wish to remove in the grey layer list, and click Remove. Click OK when prompted that this is correct.

R GUI image
R GUI image

Add Analysis Features to the ArcMap Table of Contents

  1. To add the selected layer to the ArcMap table of contents, select the layer you wish to add by clicking on it in the grey layer list, and click the Add to Table of Contents button.
R GUI image

R GUI image

Choose a Map Unit

In some instances, such as TEUI mapping or Soil Survey, you may wish to identify individual polygon features as belonging to a specific map unit. This is simply a repeating identifying symbol or value which will be used to aggregate the statistics of the polygons belonging to that group. You do not need to select a map unit column in order to run the Toolkit. The symbol must be present as an attribute column in either the feature class layer or as an attribute in the VAT table of a discrete raster.

  1. Select the Analysis Feature layer that you wish to add a Map Unit symbol for.

  2. Select the appropriate map unit column attribute name for your map unit symbol in the drop down menu

R GUI image

R GUI image

Add Raster Data to the Toolkit Project

  1. To add raster data to your Toolkit project, select the Add Layer button under the Raster heading
    R GUI image

  2. In the Browse to Raster file location dialog window, navigate to the file location of the raster data you wish to add to the Toolkit project.

R GUI image

R GUI image

  1. Select the layer you wish to add and select the Add button.

Add Raster Data to the ArcMap Table of Contents

  1. To add raster data to your ArcMap Table of Contents, first select the layer or layers you wish to add by clicking on them in the grey layer list under the Rasters section.
R GUI image

R GUI image

  1. Click the Add to Table of Contents button.

  2. To add multiple raster data layers at once, hold down the control button while selecting the individual layers you wish to add.

Calculate Statistics

  1. To calculate statistics, you must first select the analysis feature layers and raster data layers you wish to run statistics on by checking the box next to each layer.
R GUI image

R GUI image

  1. Click on the Generate Stats button. Select Yes when prompted by the time warning that appears. This process may take a significant amount of time depending on the amount of data selected to be run.
R GUI image

R GUI image

  1. A dialog will appear which informs the user of the duration, features or rasters being used, and the progress. The tool can calculate more than 1 million cells a second.
R GUI image

R GUI image

  1. When the statistics run is finished, click OK.

SECTION 4. Create and Visualize Graphs and Tables

A primary Toolkit feature is the ability to view the generated statistics in both a graph and table format. The Toolkit creates zonal statistics for the analysis features and raster data selected in the data manager. The results can be visualized in graph form or as summary table both by individual feature or by map unit if one was chosen.

Open a graph window

  1. To create a graph, click on the graph button on the Toolkit Toolbar.
R GUI image

R GUI image

  1. A blank graph window will appear. You can open as many individual graph windows as you want.
R GUI image

R GUI image

  1. Click on the Add Series Tab in the upper left corner of the graph window.
R GUI image

R GUI image

  1. The Add Series window pane will open up. You can pin the window open by clicking on the sideways pin in the upper right hand of the Add Series window pane or close it by clicking on the vertical pin.
R GUI image

R GUI image

  1. In a step wise fashion, select… MUPComparison: This is the default option. This allows the greatest flexibility when comparing individual polygons against other individual polygons, individual polygons against map units, or map units against map units (if chosen).

MU Comparison: This data view type contains unique default graphs when comparing map units or other aggregated statistics types. These graphs have the mean (solid line), range (darker colored area), and standard deviation (lighter colored area) on the same graph, for each polygon within a map unit.

R GUI image

R GUI image

Chart type: You can choose the type of graph that best represents your data. The default is the area chart type.

Area chart
R GUI image

Bar chart

R GUI image

R GUI image

Point chart

R GUI image

R GUI image

Line chart

R GUI image

R GUI image

Radial chart

R GUI image

R GUI image

Feature Source: Choose which feature analysis layer you would like to graph. Polygons or discrete raster

R GUI image

R GUI image

Map Unit: Chose from which map unit you would like to select a polygon. Alternatively, select just the map unit you want to graph as a whole (i.e., don’t select an individual feature below). Also, you can leave this option blank which is the default.

R GUI image

R GUI image

Feature: Select the individual feature you would like to view statistics for. This is the Feature ID or FID.

R GUI image

R GUI image

Raster Source: Select the raster layer you wish to view the statistics from.

R GUI image

R GUI image

Raster Band: Select the raster band from the raster layer.

R GUI image

R GUI image

Set Graph: Click once finished.

R GUI image

R GUI image

Common tasks with graphs

How do I add more individual features or map units to same graph? Within the Add Series pane, simply select a new feature source (if desired), map unit, feature, or raster source. Click Set Graph and the new feature or map unit will be added to the graph and the legend in a new color.

R GUI image

R GUI image

Can I add multiple axes to the graph? For example, a graph of elevation and percent slope?
To add another axis to your graph window, simply select the feature source, a map unit if desired, or a feature if desired, but select a new raster data source. You will notice in the graph legend map units and features are separated by the raster layer.

R GUI image

R GUI image

Is there a way to find a specific value on the graph?
If you hover over a specific spot in the graph, a line and window will appear with the specific values that your cursor is on. This is valuable if you are trying to identify outliers in your data.

R GUI image

R GUI image

Can I normalize the data ranges?
To normalize data so that the data ranges are comparable within the graph window, check the normalize button at the top of the graph.

R GUI image

R GUI image

How do I remove a graph series from the graph window?
Simply hit the red X next to the graph series you would like to remove in the graph legend.

R GUI image

R GUI image

How do I view the tabular data associated with each map unit or feature in the graph window?
To view the tabular and descriptive statistics of each map unit or feature in the graph window, simply select the table icon next to the data series in the graph legend.

R GUI image

R GUI image

SECTION 5. Exporting Statistics Data to Excel

The statistical data summaries produced by the toolkit can be exported to excel as a .CSV file for further use.

Export individual feature statistics

R GUI image

R GUI image

  1. On the main menu bar, click the export table button with the green arrow. Note this may take a little while as the program is gathering up the data.

  2. Navigate to the location you would like to save the .CSV file. Give the file a name and click Save.

R GUI image

R GUI image

Export map unit feature statistics

R GUI image

R GUI image

  1. On the main Toolkit menu bar, click the export map unit table button with the blue arrow. Note this may take a little while as the program is gathering up the data.

  2. Navigate to the location you would like to save the .CSV file. Give the file a name and click Save.

9.3 ArcGIS tools

Gathering statistics of raster data cells for polygon data sets like SSURGO is typically achieved by the use of the Zonal Statistics as Table function from the Spatial Analyst toolbox. The output will be a tabular summary for the specified Zone, usually map unit symbol or individual polygons.

Example for summarizing by Map Unit Symbol:

Open the Zonal Statistics as Table Tool in the Zonal Toolbox.

R GUI image

R GUI image

The input zone field is MUSYM and the input raster is slope:

R GUI image

R GUI image

Which produces a table with the following output:

R GUI image

R GUI image

The use of the Mean and Standard Deviation are acceptable, provided the distributions are close to normal and the interest is in summarizing the entire population of map units. To get a closer look at individual polygons, summarize using the FID:

R GUI image

R GUI image

Which produces a table with the following output:

R GUI image

R GUI image

Use the Join capability to associate the table of statistics to the spatial data:

R GUI image

R GUI image

R GUI image

R GUI image

Which lets you view results by polygon and search for outliers or polygons that require further investigation:

R GUI image

R GUI image

In this example, 4 polygons of ChC2, Coshocton silt loam, 6 to 15 percent slopes, eroded, have an average slope greater than 15 percent. Discrepancies like this will need to be investigated and resolved.

R GUI image

R GUI image

In another example using a Box plot for assessment of a map unit with a slope class of 15 to 25 percent slopes, indicates half of the polygons with an average slope less than 15 percent:

R GUI image

R GUI image

10 References

FAO Corporate Document Repository. http://www.fao.org/docrep/field/003/AC175E/AC175E07.htm

Lane, D.M. Online Statistics Education: A Multimedia Course of Study (http://onlinestatbook.com/ Project Leader: David M. Lane, Rice University

Seltman, H. 2009. Experimental Design and Analysis. Chapter 4: Exploratory Data Analysis. Carnegie Mellon University. http://www.stat.cmu.edu/~hseltman/309/Book/

Vaughan, R., & Megown, K., 2015. The Terrestrial Ecological Unit Inventory (TEUI) Geospatial Toolkit: user guide v5.2. RSAC-10117-MAN1. Salt Lake City, UT: U.S. Department of Agriculture, Forest Service, Remote Sensing Applications Center. 40 p. http://www.fs.fed.us/eng/rsac/programs/teui/about.html

Tukey, John. 1977. Exploratory Data Analysis, Addison-Wesley

Tukey, J. 1980. We need both exploratory and confirmatory. The American Statistician, 34:1, 23-25 Webster, R. 2001. Statistics to support soil research and their presentation. European Journal of Soil Science. 52:331-340. http://onlinelibrary.wiley.com/doi/10.1046/j.1365-2389.2001.00383.x/abstract

11 Additional Reading

Peng, R. D., 2016. Exploratory Data Analysis with R. Leanpub. https://bookdown.org/rdpeng/exdata/

Wickham, H., and G. Grolemund, 2017, 7 - Exploratory Data Analysis, pp: 81-109. In H. Wickham and G. Grolemund. R for Data Science. O’Reilly Media Inc., Sebastopol, CA. http://r4ds.had.co.nz/exploratory-data-analysis.html

Healy, K., 2018. Data Visualization: a practical introduction. Princeton University Press. http://socviz.co/

Helsel, D.R., and R.M. Hirsch, 2002. Statistical Methods in Water Resources Techniques of Water Resources Investigations, Book 4, chapter A3. U.S. Geological Survey. 522 pages. http://pubs.usgs.gov/twri/twri4a3/